Grow Data Skills

Capgemini | Data Engineer Interview Experience | 6 YoE

Tirthankar Sarkar
30-Jul-2025
5 mins read

Round 1: Technical Screening (SQL + Pyspark)

↳ Given below 2 tables:

How many records will come up upon performing - inner join, outer join, left join, right join?

↳ Given an employees file having columns:

employee_id, employee_name, employee_salary, employee_dept, employee_manager

- Write a custom schema to read this file, given that separator is '|

- First 4 lines in the file are blank, 5th line is header. How would you read it?

- Suppose the values for column employee_name is of the format -> "Himani H Sharma". Derive three columns from it - first_name, middle_name, last_name

- Find all employees who have salaries greater than their manager's salary.

- Find the third highest salary in the whole table, and then per department.

↳ What's the difference between persist() & cache()?

↳ What's the difference between map() & flatMap()?

↳ What kind of AWS services have you used at your job?

Round 2: Client Interview Round (Pyspark + SQL)

↳ What kind of projects have you worked on so far? Describe the architecture along with full/incremental load strategies, and how data validation is performed.

↳ You are working on a data pipeline that processes user activity logs from a mobile app. The logs are stored in JSON format in a data lake (e.g., S3 or ADLS). Each record contains the following fields:

- user_id (string)

- event_type (string: e.g., "login", "purchase", "logout")

- event_timestamp (timestamp)

- device_type (string: e.g., "iOS", "Android")

1. Read the JSON file and filter out all the "purchase" event types.

2. For each user, find out the difference between maximum & minimum purchase timestamps.

↳ Below table has users & mobile numbers:

↳ Write Spark Program to generate count of distinct mobile numbers for each user.

Expected output:

↳ Given tables - Customers (id, name) & Orders (orderid, customer id, amount)

Return top 3 customers by total spending along with names and total spent.

Interviewer instruction: Use CTE, join & window function to solve the problem.

↳ Given a table having below cricket teams:

Find out all possible combinations of match fixtures without any repetitions.

For example - Ind vs NZ & NZ vs Ind are considered to be the same match fixture. Only one combination of this should be in the output.

↳ In windowing functions, which clause is mandatory to have? partition by or order by?

↳ How would you determine the number of jobs, tasks & stages given a sequence of transformations like - reading a file, applying union transformation, group by followed by writing the output to a table.

↳ How would 10 GB of data be processed in spark?

↳ Is read() an action or transformation?

↳ Are you aware of SCD? Describe SCD-1, SCD-2 & SCD-3, and whether you've implemented them at your job.

Round 3: HR round

↳ Salary negotiation, job location preference discussion.

⭐ Cracked all rounds, got the job offer!

Complete Data Engineering With Azure - Basic To Advance

Admission Open

By Grow Data Skills

Enroll Now

Complete Data Engineering With AWS - Basic To Advance

Admission Open

By Grow Data Skills

Enroll Now